Briefings in Bioinformatics — Latest Matching Preprints

1

Heterogeneity-driven adaptive scale graph learning for subcellular spatial transcriptomics

Shi, W.; Shen, C.; Liu, Y.; Xiao, Q.; Luo, J.

2026-05-21 bioinformatics 10.64898/2026.05.19.726162 medRxiv

Top 0.1%

14.9%

Show abstract

MotivationSpatial transcriptomics enables gene expression profiling within intact tissue sections, providing an important basis for analyzing tissue organization, cellular heterogeneity, and microenvironmental interactions. However, existing spatial structure identification methods often integrate spatial information using fixed neighborhoods or predefined smoothing scales, which limits their ability to adapt to region-specific structural heterogeneity. In homogeneous regions, broader spatial smoothing can help preserve continuous tissue structures, whereas in regions with complex boundaries or mixed cell populations, excessive smoothing may obscure local expression differences and fine-scale structural changes. Therefore, it is necessary to develop an adaptive graph learning framework that can adjust the range of spatial information integration according to tissue structural heterogeneity. ResultsIn this study, we propose HAST, a heterogeneity-driven adaptive-scale graph learning framework for spatial transcriptomics. HAST adaptively determines graph filtering scales according to spatial structural heterogeneity, enabling flexible information aggregation across different tissue regions. It further decomposes gene expression signals into low-frequency structural components and high-frequency residual components, thereby jointly modeling global spatial continuity and local expression variations. Experiments on high-resolution spatial transcriptomics datasets show that HAST improves spatial structure identification and cross-section generalization. Tumor-enriched cluster identification and neighborhood enrichment analysis further demonstrate its ability to characterize tumor-associated spatial regions and microenvironmental organization.

2

TRACE: a graph-based workflow for TCR-epitope prioritization and tumor-reactive T-cell identification

Chen, Y.; Giuliano, V.; Dacillo, I.; Lin, W.; Yan, Y.; Luo, P.

2026-05-31 bioinformatics 10.64898/2026.05.27.728217 medRxiv

Top 0.2%

14.3%

Show abstract

Accurate prioritization of T-cell receptor (TCR)-epitope interactions and identification of tumor-reactive T cells are important but difficult steps in immunotherapy-oriented bioinformatics workflows. Existing methods typically address these tasks separately and either model TCR-epitope pairs as independent observations or rely primarily on transcriptomic signatures. In this study, we present TRACE (TCR-epitope pRioritization And T-Cell idEntification), a graph-based computational workflow that unifies both applications within a single heterogeneous graph framework. The protocol represents TCRs, epitopes, and T cells as typed nodes connected by similarity and association edges, and combines pretrained sequence embeddings with edge-aware graph attention, Laplacian positional encoding, and bidirectional cross-domain attention. Applied to the IEDB and VDJdb benchmarks, TRACE achieved AUROC/AUPR values of 0.937/0.922 and 0.992/0.990, respectively, outperforming five state-of-the-art algorithms. In addition, on a single-cell RNA-seq dataset, the workflow achieved an AUROC of 0.984 and an AUPR of 0.984, substantially exceeding transcriptomic signature-based baselines for tumor-reactive T-cell identification. Ablation analysis showed that Laplacian positional encoding provided the largest performance gain, particularly in sparse graph settings. These results suggest that heterogeneous graph modeling can serve as a practical protocol for integrating receptor sequence, antigen context, and cellular phenotype in computational immunology.

3

Benchmarking long-context genome language models on biosynthetic gene clusters

Hirota, K.; Higashi, K.; Kurokawa, K.; Yamada, T.

2026-05-15 bioinformatics 10.64898/2026.05.12.724296 medRxiv

Top 0.3%

12.0%

Show abstract

Recent advances in language models for natural language processing have spread to the field of genomics, driving the development of genome language models (gLMs) to decipher genomic information. Cutting-edge long-context gLMs are promising approaches for understanding and designing biological complexity, but their evaluation remains underdeveloped. In this study, we introduce BGCs-Bench, a unified benchmark focused on biosynthetic gene clusters for assessing long-range genomic modeling on three downstream tasks: biosynthetic class prediction, taxonomic classification and coding sequence annotation. Using BGCs-Bench, we perform systematic and layer-wise evaluations of the embedding representations of long-context gLMs, demonstrating that layer selection is crucial for downstream task performance. In addition to the evaluation results, the logit lens analysis of autoregressive gLMs suggests that StripedHyena-based models consist of earlier layers to encode biologically meaningful information from input DNA sequences and deeper layers to optimize embeddings for sequence generation. These findings provide insights for more effective development and application of long-context gLMs.

4

Antimicrobial peptide databases and prediction tools: Toward a standard evaluation framework

Cisterna Garcia, A.; Gonzalez Lopez, A. M.; Vozi, A.; Esteban, M. A.; Egli, A.; Jutzeler, C.; Palma, J.; Sanchez-Ferrer, A.; Botia, J. A.

2026-05-21 bioinformatics 10.64898/2026.05.19.726290 medRxiv

Top 0.3%

10.4%

Show abstract

Antimicrobial resistance (AMR) has a profound impact on animal and human health and is associated with substantial morbidity, mortality and public health costs. There is a clear need to develop novel, effective antibiotic agents, which can overcome the current AMR crisis. Antimicrobial peptides (AMPs) may offer such a solution and have attracted growing attention for their potential to combat AMR. In parallel, the growing availability of peptide sequences in public databases has stimulated the development of numerous machine learning and deep learning tools to predict antimicrobial activity computationally. However, it remains unclear how reliably these tools can be compared, as existing studies often rely on heterogeneous datasets and inconsistent evaluation protocols that may lead to data leakage and inflated performance estimates. This raises a central question: what evaluation criteria and benchmark resources are needed to enable fair, reproducible, and biologically meaningful assessment of AMP prediction tools? We address this question by focusing specifically on antibacterial peptides (ABPs). We first provide an overview of AMP databases relevant to antibacterial activity and compare their content, redundancy, and experimental metadata. We then critically assess existing computational tools for ABP prediction, highlighting key limitations related to dataset construction, affinity to certain sequences, data leakage, and inconsistent performance reporting. Based on these limitations, we propose a reference evaluation framework designed to improve comparability, reproducibility, and practical utility in ABP prediction. Finally, we provide targeted recommendations for AMP databases and future tool development to support more robust progress in the computational discovery of ABPs.

5

The Paipu framework enables creation of a large-scale mammalian cancer transcriptomics atlas

Smith, B. S.; Smith, L. A.; Lee, J.-H.; Cahill, J. A.; Graim, K.

2026-05-18 bioinformatics 10.64898/2026.05.14.725161 medRxiv

Top 0.4%

9.9%

Show abstract

A plethora of studies have identified shared molecular mechanisms involved in tumor development across humans and other mammalian species. While these two-species analyses advance understanding of human disease, extending them across many species would provide evolutionary insight into molecular mechanisms driving human cancers. However, this expansion requires knowledge transfer and harmonization across species. Genomic differences between species, including variation in genome annotation quality, have historically hindered multi-species large-scale atlas creation. To overcome these challenges, we present Paipu, a comprehensive pipeline designed to streamline querying, preprocessing, harmonization, and retrieval of large-scale RNA-seq data and associated metadata from the NCBI Sequence Read Archive (SRA). Paipu facilitates multi-species analysis by creating a harmonized atlas from user-defined search terms and species. It consists of three components: reference genome preparation, SRA metadata retrieval, and RNA-seq data processing. We apply Paipu to 188 cancer-related terms in 239 non-human mammalian species, creating a harmonized atlas of 3,484 RNA-seq samples spanning 17 species and 35 cancers. This pan-mammalian pan-cancer atlas enables myriad comparative genomics analyses that leverage genetic variation to better understand rare human cancers. As such, Paipu serves as a resource for cross-species cancer genomics and supports atlas creation for any set of species and search terms. Graphical Abstract

6

Assessing and Optimizing Low-Frequency Somatic Mutation Detection: A Multi-Platform High-Throughput Sequencing Perspective

Feng, B. N.; Lin, Y.; Liu, L.; Lin, Q.; Lin, Y.; Liu, Y.; Li, J.; Lei, C.; Chen, C.; Yang, M.; Peng, X.; Zhou, Z.; Yan, Q.; Sun, L.; Li, Q.

2026-06-01 bioinformatics 10.64898/2026.05.28.728367 medRxiv

Top 0.5%

9.0%

Show abstract

The availability of multiple commercial short-read sequencing platforms necessitates systematic cross-platform performance comparisons, particularly for challenging applications such as low-frequency somatic mutation detection. Here, a large-scale targeted sequencing dataset from five Genome in a Bottle (GIAB) human genomic DNA reference standards, HG001 to HG005, alongside Twist Biosciences cfDNA reference standards featuring 1% variant allele frequency (VAF), was generated by six platforms (NovaSeq 6000, NovaSeq X, FASTASeq 300, GenoLab M, SURFSeq 5000, and MGISEQ-T7). To build a realistic benchmark while keeping authentic sequencing backgrounds, we developed PosMix, a simulating tool that generates position-specific VAFs. To overcome the limitations of conventional variant callers (high recall with poor precision for VarScan2, higher precision with lower recall for Strelka2/Mutect2), we developed SomaticXGB, a machine learning-based caller. In this study, SURFSeq 5000 consistently exhibited the lowest error rates and achieved superior accuracy for VAFs as low as 0.5%, outperforming all other sequencing platforms. On the other hand, SomaticXGB attained F1 scores of approximately 0.92 on simulated datasets with VAFs ranging from 0.5% to 1.5% and 0.89 on Twist 1% standards, substantially outperforming conventional methods. This work delivers a valuable rich multi-platform data resource, offering a standardized pipeline for performance benchmarking and a machine learning-based strategy for optimized somatic mutation detection.

7

Entropy Fusion DNA: Alignment-Free Gene Fusion Detection through Entropy and Mutual Information Descriptors

Benevento, G.; Malandrino, D.; Ture, A.; Zaccagnino, R.

2026-05-30 bioinformatics 10.64898/2026.05.27.728176 medRxiv

Top 0.5%

8.5%

Show abstract

Gene fusions are clinically relevant genomic alterations and key cancer biomarkers. Their computational detection remains dominated by alignment-based pipelines, whose reliance on read mapping, reference annotations, and heuristic filtering makes them sensitive to mapping ambiguities, annotation incompleteness, repetitive regions, and false positives. Recent machine learning (ML) strategies aim to learn fusion-related patterns directly from sequencing data, but their adoption is still limited by dataset-specific biases, synthetic data artifacts, class imbalance, and representations that may overlook the structural organization of biological sequences. Theoretical and statistical sequence descriptors remain underexplored as efficient tools for capturing informative structural signals in biological reads. In this work, we investigate whether fusion-related information can be inferred directly from the statistical organization of DNA sequences. Each sequence is encoded into a compact, interpretable, and alignment-free feature space combining Shannon and Renyi entropy, lagged and base-resolved mutual information, GC content, and rarefied k-mer richness descriptors. Our goal is to assess whether these information-theoretic features encode discriminative sequence signatures associated with fusion events. For discriminating fusion-derived from non-fusion sequences, nested cross-validation selected K-nearest neighbors as the most effective classifier, achieving strong held-out performance on the balanced benchmark (AUROC = 0.892, AUPRC = 0.865). The same representation was then evaluated on fusion-positive samples for fusion partner prediction and breakpoint localization, achieving strong top-k partner identification accuracy and stable breakpoint regression performance. Moreover, a two-stage strategy in which the binary classifier first filters candidate reads further improved partner prediction, suggesting its use as an enrichment step for downstream fusion characterization. Although performance decreased under repeated fusion-pair-disjoint evaluation, it remained clearly above random expectation, supporting the transferability of the proposed descriptors to unseen fusion pairs. Breakpoint-centered validation further revealed increased local sequence complexity, altered short-range dependency structure, and modest but significant microhomology enrichment around fusion regions. Such findings support an interpretable alignment-free framework where information-theoretic features provide predictive and biologically informative signals for gene fusion analysis. The framework is available at: https://github.com/FLaTNNBio/EntropyFusionDNA Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=73 SRC="FIGDIR/small/728176v1_ufig1.gif" ALT="Figure 1"> View larger version (23K): org.highwire.dtl.DTLVardef@805fa3org.highwire.dtl.DTLVardef@6f6cdorg.highwire.dtl.DTLVardef@1352c8borg.highwire.dtl.DTLVardef@1ff780b_HPS_FORMAT_FIGEXP M_FIG C_FIG HighlightsO_LIAlignment-free information-theoretic DNA descriptors detect gene fusions. C_LIO_LIResolved mutual-information features provide the strongest predictive signal. C_LIO_LITwo-stage screening enriches partner-gene prediction and breakpoint analysis. C_LI

8

GraphTox: A Semi-Supervised Pre-Trained Framework for Peptide Toxicity Prediction using Geometric Graph Transformer and LORA-Based Finetuning

BHADURI, S.; Das, D.; MITRA, P.

2026-05-27 bioinformatics 10.64898/2026.05.23.727225 medRxiv

Top 0.5%

8.5%

Show abstract

Peptides are widely used as potential therapeutic agents in drug discovery and biotechnology because they are specific, effective, and relatively inexpensive to produce. They are used in drug development, vaccines, and antimicrobial treatments. However, peptide toxicity remains a major concern as it offers unwanted toxic consequences, such as membrane rupture, haemolysis, tissue damage and adverse immunological response. Early detection of toxic peptide candidates is vital for the development of safe and effective therapies. Current computational methods for predicting peptide toxicity are largely based on hand-crafted sequence descriptors or sequence-only deep learning architectures that may not fully account for the underlying 3-dimensional structural determinants of peptide toxicity. We introduce GraphTox, a structure-aware geometric deep learning framework which combines self-supervised graph representation learning with hierarchical structural modelling to accurately predict peptide toxicity. Our framework learns geometry-aware embeddings from peptide structural graphs via self-supervised masked residue reconstruction, based on a Masked Graph Autoencoder (MGAE) built on a Geometric Graph Transformer (GGT) encoder. The pretrained structural representations are cross fused via a multi-scale U-Net architecture to capture both local residue-level interactions and global conformational patterns associated with peptide toxicity. GraphTox explicitly models spatial relationships between residues, thereby efficiently capturing structural aspects that are generally neglected by sequence-based predictors, such as residue clustering, hydrophobic interactions and electrostatic organization. On benchmark datasets our framework shows superior performance and interpretability over the existing state-of-the-art methods. Our hybrid hierarchical structural modelling framework is a superior computational platform to improve the prediction of peptide toxicity and expedite the creation of safer peptide therapies. https://github.com/debraj-55555/GraphTox

9

On the benchmarking of clustering algorithms and hyperparameter influence for cell type detection in single-cell RNA sequencing data.

Szmigiel, A.; Gesteira Costa Filho, I.; Campello, R. J. G. B.

2026-05-17 bioinformatics 10.1101/2025.08.20.671270 medRxiv

Top 0.5%

8.4%

Show abstract

Clustering single-cell RNA-seq (scRNA-seq) data and related protocols remains a major challenge due to high dimensionality, sparsity, and noise. Despite numerous benchmarking studies aiming to identify the most suitable clustering methods, many suffer from methodological flaws that can undermine their conclusions. A major challenge in benchmarking is selecting representative datasets that cover the diversity of scRNA-seq experiments and include laboratory-verified labels for reliable evaluation. Consistent preprocessing of all inputs to benchmarked algorithms is crucial, as it significantly impacts performance. Beyond selecting an algorithm, a thorough exploration of hyperparameters is also essential to assess robustness and identify configurations that maximize performance. We focus on proposing an improved benchmarking framework that addresses common methodological issues in prior studies. We illustrate our proposed methodology in a case study comparing the classic Leiden and Louvain clustering algorithms with extensive hyperparameters exploration on a carefully curated collection of real gold standard datasets. By evaluating clustering performance across different hyper-parameter selection scenarios, we show that benchmarking results can be misleading, either overestimating or underestimating performance depending on how the hyperparameter space is explored. In our illustrative case study, benchmarking results do not reveal any practically relevant performance differences between the Louvain and Leiden algorithms. In contrast, we show that overlooked factors such as graph construction and quality functions critically influence clustering outcomes, particularly un-der suboptimal settings of numerical hyperparameters--the neighbor-hood size k used for similarity graph construction and the resolution hyperparameter in graph-based clustering algorithms. While noticeable trends have been observed in terms of how different (dis)similarity functions affect performance, the impact of this choice is limited and, to some extent, overridden by the graph-building approach. Across different graphs, there is a noticeable trade-off between achieving optimal performance with ideally tuned numerical hyperparameters and maintaining robustness under more realistic, unsupervised, and suboptimal settings. All in all, the analysis of our illustrative benchmarking case study offers clear guidance and objective recommendations for practitioners in the field. Most importantly, as the main contribution of this manuscript, our proposed framework sets a foundation for more reliable scRNA-seq clustering evaluation and benchmarking in future studies.

10

MSLipidMapper: a pathway-centered lipidome analysis environment linking lipid class, acyl-chain subsets, and multi-omics data

Oka, T.; Nishida, K.; Harayama, T.; Tsugawa, H.

2026-05-25 bioinformatics 10.64898/2026.05.21.726751 medRxiv

Top 0.6%

8.2%

Show abstract

Lipids exhibit extensive structural diversity arising from variation in lipid classes, subclasses, and acyl-chain compositions, making systematic interpretation of lipidomics data challenging. Although untargeted lipidomics enables the quantification of hundreds to thousands of lipid molecular species, downstream analyses often treat pathway-level summaries, molecular-species visualization, structural subsetting, and multi-omics interpretation as separate steps. Here, we present MSLipidMapper, an R/Shiny-based lipidomics data exploration environment for pathway-centered and structure-aware analysis of annotated lipidomics datasets. MSLipidMapper reconstructs annotated lipid peak tables as Bioconductor SummarizedExperiment objects, thereby organizing quantitative lipid abundance values, sample metadata, lipid subclass annotations, and parsed acyl-chain features within a unified data structure. Lipid molecular species are summarized on static, curated lipid metabolic pathway maps at the subclass level while retaining direct links to the underlying molecular species and acyl-chain annotations. This design enables users to inspect molecular-species patterns underlying each pathway node, define lipid subsets based on structural features such as specific acyl chains, and re-project these subsets onto the same pathway context. Gene or protein expression data can also be overlaid on pathway-associated reactions to support multi-layer interpretation of lipid metabolism. The program is showcased using publicly available aging lipidome datasets of mice, illustrating how subclass-level pathway summaries can be connected to molecular-species heatmaps, acyl-chain-defined subsets, and transcriptome or proteome information.

11

simCRISPR: Modeling experimental complexity in pooled CRISPR screens

Zhu, Z.; Dong, X.; KIM, C. H.; Maugee, C.; Barbazuk, W. B.; Vulpe, C.; Bacher, R.

2026-05-15 bioinformatics 10.64898/2026.05.14.725042 medRxiv

Top 0.7%

6.9%

Show abstract

Pooled CRISPR screens are widely used to investigate gene function and uncover genetic interactions. However, benchmarking computational methods for detecting gene-by-environment (GxE) interactions remains difficult because ground truth is rarely available and existing simulation tools are not designed for GxE screening contexts. To address this, we developed simCRISPR, a flexible simulation framework for generating pooled CRISPR screen data under complex experimental designs. Using simulated datasets informed by empirical CRISPR screen designs, we evaluated commonly used analysis methods, comparing normalization strategies based on safe-harbor versus non-targeting sgRNAs and assessing empirical log2FC thresholds as an additional effect-size criterion. We found that safe-harbor-based normalization improved interaction detection when DNA damage-related effects were present, particularly when combined with empirical log2FC thresholding for DESeq2. Application of this workflow to a doxorubicin GxE screen further showed that safe-harbor-based normalization reduced bias in log2FC distributions and identified additional biologically relevant candidates. simCRISPR is available at https://github.com/bachergroup/simCRISPR.

12

DamageFormer: a damage-aware multimodal deep learning framework for DNA lesion identification from nanopore sequencing

Yang, Q.; Li, L.; Ma, Q.; Yin, R.

2026-05-18 genomics 10.64898/2026.05.14.725245 medRxiv

Top 0.7%

6.8%

Show abstract

BackgroundDNA lesions arise from endogenous metabolism and environmental exposure and are the major drivers of mutagenesis, aging, and cancer development. However, mapping DNA damage at nucleotide resolution remains a technically challenging task. Nanopore sequencing enables direct detection of chemical perturbations through alterations in ionic current signals. Despite this potential, existing computational approaches remain limited in their capacity to generalize across diverse lesion types and to effectively integrate nucleotide sequence context with raw signal information for accurate detection and localization. ResultsWe presented DamageFormer, a multimodal deep learning framework for detection and localization of DNA lesions using native nanopore sequencing data. Central to this framework is LesionBERT, a damage-aware genomic foundation model built upon DNABERT-2 and enhanced with lesion-focused reconstruction objectives to improve representation of chemically modified bases. DamageFormer integrated LesionBERT with a neural signal model through an adaptive gating mechanism, enabling dynamic weighting of sequence context and nanopore signal evidence. The model was trained using a joint objective that combines prediction, localization, and contrastive alignment losses to promote cross-modal coherence and spatial precision. On an oxidative DNA damage benchmark comprising paired sequence and signal data, DamageFormer achieved an AUROC of 0.99997 for lesion detection and a mean absolute localization error of 0.00439, consistently outperforming state-of-the-art baselines. Model interpretation analyses revealed context-dependent modality weighting that adapts to variation in signal quality and sequence ambiguity. The proposed framework further generalizes to chemically distinct guanine lesions not observed during the training process, demonstrating its robustness and transferability to unseen damage types. ConclusionsDamage-aware biological language modeling combined with adaptive multimodal fusion enables accurate and interpretable identification of DNA lesions from nanopore sequencing data. This framework provides a scalable approach for characterizing genome-wide damage landscapes and illustrates how chemical DNA information can be systematically incorporated into genomic language models. The source code and pretrained models of this work are available at: https://github.com/UF-HOBIYin-Lab/DamageFormer.

13

ARACoFusion: Uncertainty-aware calibrated deep learning for protein-protein interaction network prediction in Arabidopsis thaliana

Sarkar, D.; Sarkar, C.

2026-05-26 bioinformatics 10.64898/2026.05.22.727120 medRxiv

Top 0.8%

6.4%

Show abstract

Accurate mapping of the Arabidopsis thaliana protein-protein interaction (PPI) network is essential for deciphering complexity of plant systems biology. Here, we present ARACoFusion, a specialized deep learning architecture designed to predict inter-protein connectivity directly from primary sequences. To capture the asymmetric dependencies between plant proteins, the framework utilizes a reciprocal cross-attention encoder combined with latent interaction projections and multi-source feature fusion. Addressing the severe class imbalance inherent in plant interactomes, the model integrates uncertainty-aware variance regularization and focal loss with label smoothing, further enhancing reliability through post-hoc probability calibration via temperature scaling. Extensive benchmarking on gold-standard Arabidopsis datasets demonstrates that ARACoFusion significantly outperforms existing plant-specific predictors, achieving superior scores in Area Under the Precision-Recall Curve (AUPRC), Balanced Accuracy, and Matthews Correlation Coefficient (MCC). Additionally, the model exhibits robust cross-species generalization and clear class separability in t-SNE latent space visualizations. To facilitate community-wide usage, we provide a dedicated web server for scalable network-level inference at https://ARAcofusion.compbiosysnbu.in/.

14

DMPKformer: An Interpretable Multimodal Deep Learning Framework for Reliable ADMET Property Prediction

A. S., B. G.; Singh, A.; Kanchan, S.; Anapat, S.; Gurram, K.; Kulkarni, N. M.

2026-05-29 bioinformatics 10.64898/2026.05.28.728612 medRxiv

Top 0.8%

6.3%

Show abstract

Accurate prediction of absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties remains a critical challenge in drug discovery. Traditional single modality approaches often fail to capture the complex, multi-scale relationships governing molecular behavior across physicochemical, structural, and pharmacokinetic dimensions. In this work, we propose a multi-modal deep learning framework that integrates complementary molecular representations, MACCS fingerprints, molecular graphs, and physicochemical descriptors to achieve robust ADMET property prediction. Each modality is modeled using a specialized neural subnetwork tailored to its structural characteristics: a self-attention-based Transformer encoder for MACCS fingerprints, a Graph Attention Network (GAT) for molecular graph representations, and a tanh-activated multilayer perceptron for RDKit-, PaDEL-, and Mordred-derived descriptors. Each modality is independently trained for binary classification, and latent embeddings extracted from internal layers serve as transferable molecular representations. These embeddings are subsequently fused and fine-tuned via a tanh-activated dense network and shared prediction head to form a unified ADMET predictor. The proposed framework achieves competitive performance across multiple TDC ADMET benchmarks while providing enhanced interpretability through modality-specific attention mechanisms. In addition, the incorporation of latent-space out-of-distribution (OOD) confidence estimation enables identification of high-confidence operating regions, improving the reliability and practical applicability of the framework for molecular property prediction in drug discovery workflows.

15

KmerSignificance Score: A discriminative and biologically-informed framework for viral k-mer prioritization

Lebatteux, D.; Corso, F.; Soudeyns, H.; Boucoiran, I.; Gantt, S.; Banire Diallo, A.

2026-05-21 bioinformatics 10.64898/2026.05.15.725339 medRxiv

Top 0.9%

6.3%

Show abstract

Distinguishing closely related viral strains requires identifying genomic regions where subtle sequence differences carry biological significance. While k-mer-based approaches offer computational efficiency for genome analysis, existing methods lack standardized frameworks for evaluating which k-mers are most informative. Current selection strategies focus primarily on statistical discriminative power without integrating biological relevance. We introduce KmerSignificance Score (KSS), a k-mer prioritization framework combining three components: an information-theoretic method measuring strain-distinguishing capacity, an optimized amino acid substitution matrix (MIYATA EVO) for mutation impact assessment, and protein-level functional importance scoring derived from UniProt annotations. KSS produces standardized scores in the [0, 1] interval, enabling direct cross-dataset comparison. The discriminative component achieved classification performance comparable or superior to all tested alternatives (mean F1 = 0.880 vs. 0.718-0.877 for six established methods) while additionally providing bounded scores with consistent empirical distributions for cross-dataset comparability. MIYATA EVO, optimized via genetic algorithm, improved biophysical property correlations by 28.4% over the original MIYATA matrix. Protein scoring on 17,470 viral proteins showed robust agreement with UniProt annotation scores (Kendall{tau} = 0.777) while revealing finer functional distinctions. Literature validation on SARS-CoV-2 (278,738 sequences, 19 variants), HIV-1 (12,223 sequences, 15 subtypes), and human cytomegalovirus (HCMV; 399-646 sequences, 4-8 genotypes) confirmed that high-scoring k-mers consistently map to established variant-defining mutations, subtype-specific polymorphisms, and genotype markers. KSS provides a standardized framework for viral k-mer prioritization with applications in variant surveillance, molecular epidemiology, and functional annotation. The tool is available at https://github.com/bioinfoUQAM/KmerSignificanceScore. Author summaryIdentifying genetic differences between closely related viral strains is essential for pandemic preparedness, vaccine development, and understanding disease outbreaks. With millions of viral genomes now sequenced, researchers need tools that can rapidly pinpoint which genomic differences matter most biologically, not just which are statistically distinctive. Current k-mer-based approaches identify patterns that distinguish viral strains but cannot assess whether those differences affect protein function or disease phenotype. We developed KmerSignificance Score (KSS), a framework that we designed to rank short genomic sequences by combining three types of evidence: how well they distinguish viral strains, how much the encoded amino acid changes affect protein properties, and how functionally important the affected protein is. We standardized the resulting scores on a 0-to-1 scale, allowing direct comparison across different viruses and studies. We validated our framework on three major human pathogens (SARS-CoV-2, HIV-1, and human cytomegalovirus) and found that top-scoring positions consistently correspond to sites with documented roles in immune evasion, drug resistance, viral fitness, and strain classification. Our framework can help prioritize genomic features for surveillance of emerging variants, guide experimental validation, and support molecular epidemiology.

16

Learning from Drops: AI-Guided Integration of Liquid Biopsy Features in Cancer Studies

Andueza, M.; Villoslada-Blanco, P.; De Dreuille, B.; Alonso, L.; Sabroso-Lasa, S.; Pantel, K.; Alix-Panabieres, C.; Lopez de Maturana, E.; Malats, N.

2026-05-17 bioinformatics 10.64898/2026.05.12.724535 medRxiv

Top 0.9%

6.3%

Show abstract

Cancer is a major global health issue with rising incidence and mortality. Early detection, tumor characterization, and disease surveillance are crucial for timely and effective treatment, ultimately reducing mortality rates. Liquid biopsy (LB) has emerged as a valuable detection tool offering a non-invasive method to determine tumor-derived biomarkers in body fluids with demonstrated translational potential. To increase biomarker sensitivity, high-throughput sequencing platforms deliver massive volumes of data. Artificial Intelligence (AI) is pivotal in enabling huge and complex data integration. This contribution aims to assess the current state of integrative AI-based research in the LB field and provide methodological guidance. First, we conducted a PubMed search and found that the literature is sparse in studies integrating LB features, particularly by applying AI. When adopting the latter approach, defining the study objectives is crucial to guide the subsequent methodological aspects, including study design, patient selection criteria, sample size, nature of the LB features, and metadata to collect. Specifically, we propose strategies and tools for data preprocessing, including normalization and batch correction, as well as handling outliers and missing data. Furthermore, we recommend various Machine/Deep Learning approaches for feature selection techniques to ensure model robustness, and we highlight the importance of undergoing rigorous internal and external validations of the selected models. Assessing clinical utility and interpretability is often overlooked but fundamental for real-world implementation. In conclusion, we provide the LB scientific community with an AI-based methodological guidance to bridge the two fields and enhance the integrative analysis of LB features. Graphical abstractWorkchart for multiomics integrative studies in the liquid biopsy field. Note: CTCs, circulating tumor cells; ctDNA, circulating tumor-DNA; TEPs, tumor-educated platelets; miRNA, microRNA; cfRNAs, cell-free RNAs. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=159 SRC="FIGDIR/small/724535v1_ufig1.gif" ALT="Figure 1"> View larger version (45K): org.highwire.dtl.DTLVardef@1f250b2org.highwire.dtl.DTLVardef@18fe36corg.highwire.dtl.DTLVardef@19c02b9org.highwire.dtl.DTLVardef@176f6e0_HPS_FORMAT_FIGEXP M_FIG C_FIG

17

Mechanistic Interpretability for Protein Language Models: A Validation Framework

Chon, P.; ANDREOPOULOS, W. B.

2026-06-02 bioinformatics 10.64898/2026.05.29.727021 medRxiv

Top 0.9%

6.3%

Show abstract

Protein language models (PLMs) are shown to be powerful predictors of protein structure and function but their internal mechanisms remain poorly understood. Recent mechanistic interpretability methods have decomposed PLM representations into interpretable features, but they have not combined methods on a single biologically meaningful task. This paper tests whether an InterPLM sparse autoencoder and ProtoMech cross-layer transcoder can discover features in ESM-2 (6 layers, 8M) that can mainly discriminate between Class A {beta}-lactamase and Class B {beta}-lactamase with class C and D used as more challenging comparisons. The main goal is to find distinct features for Class A {beta}-lactamase that are not shared by other classes. We find that both methods find distinct features for Class A {beta}-lactamase, but the cross-layer transcoders show that the concepts for Class A {beta}-lactamase seems to be distributed among nodes such as in layer 4 and 6 rather than one node. We also showcase a validation framework to prevent overclaiming the role of a node, and we use it to show that several strong nodes fail in some stages of the framework meaning that they cannot be the sole node that defines Class A {beta}-lactamase.

18

Selective scoring of drug effects in multicellular co-culture systems

Dias, D.; Ianevski, A.; Bouhlal, J.; Ciboddo, M.; Nygren, P.; Klievink, J.; Lahteenmaki, H.; Dufva, O.; Mustjoki, S.; Aittokallio, T.

2026-05-24 bioinformatics 10.64898/2026.05.20.726737 medRxiv

Top 0.9%

6.2%

Show abstract

Multicellular co-culture screening reveals compound effects that depend on cell-cell interactions. Standard dose-response metrics fail to resolve effects that arise either from target-effector cell interactions or from non-specific toxic effects. Here, we developed Co-culture Efficacy Score (CES), a robust computational framework that enables systematic identification of compounds that selectively modulate cellular interactions in multicellular assays. CES framework supports both therapeutic scoring that penalizes direct effector cell toxicity, as well as a mechanistic discovery that estimates immunomodulatory effects by adjusting for effector cell responses. When screening 527 compounds across 10 hematological cancer models co-cultured with natural killer (NK) cells, CES distinguished co-culture-specific immunomodulatory effects from NK cell toxicity and cancer cell inhibitory responses, recovering systematic enhancer and inhibitory patterns. We further assessed CES robustness using higher-resolution validation screens and demonstrated its applicability to identify selective compounds in anti-CD19 CAR T-cell and antiviral host-pathogen screens. To facilitate its broad use, we implemented CES as an interactive web-application for quantitative analysis of compound responses in co-culture assays, providing a widely applicable scoring framework for cancer immunotherapy, antiviral screening and drug discovery.

19

Sequence-Based Prioritization of Promoter Regulatory Variants in Colorectal Cancer Using a DNA Foundation Model

Shome, S.; Vajinepalli, S.; Saraf, A.

2026-05-28 bioinformatics 10.64898/2026.05.25.727528 medRxiv

Top 1.0%

6.2%

Show abstract

Noncoding regulatory variants contribute to colorectal cancer (CRC) susceptibility, yet their functional interpretation remains difficult.This is mainly attributed to regulatory effects being context-dependent and most noncoding regions lack reliable genomic annotations. We have developed a computational framework that aids in prioritizing promoter-associated variants using Evo2, a large-scale autoregressive DNA foundation model. In the framework, variants were mapped to promoter regions ({+/-}1,024 bp) across [~]1,250 CRC-associated genes and scored using Evo2-derived delta scores, the difference in sequence probability between reference and alternate alleles. Promoter variants showed greater predicted regulatory impact than non-promoter variants (median delta = 0.015 vs. 0.002; overall mean = 0.018, SD = 0.011). Applying a distributional threshold (delta > 0.020; top [~]25%) identified 287 high-impact variants across 198 CRC-associated genes. These genes were enriched in CRC-relevant pathways such as Wnt signaling, p53 signaling, and cell cycle regulation and 36.4% (72/198) overlapped known cancer genes (2.3-fold enrichment, p = 8.7x10-6). Independent validation showed high-impact variants were enriched at CRC GWAS loci and overlapped transcription factor binding sites ([~]32%) and motif-disrupting positions ([~]21%), supporting their functional relevance. Together, these results show that sequence-based foundation models can scalably prioritize noncoding regulatory candidates in CRC without supervised training or predefined annotations.

20

HyperNiche: Learning Heterophilic Cellular Niches with Hypergraph Neural Networks

Mahmud, M. I.; Banerjee, T.

2026-06-03 bioinformatics 10.64898/2026.05.30.728986 medRxiv

Top 1.0%

6.1%

Show abstract

We propose HyperNiche, a hypergraph-based framework for modeling higher-order, heterogeneous cellular niches from spatial transcriptomics data. Unlike conventional graph-based methods that rely on pairwise similarity and tend to produce homogeneous clusters, HyperNiche learns anchor-centered hyperedges through a compatibility-driven mechanism that captures both homophilic and heterophilic relationships among cells. By decoupling node roles into anchor and member representations and integrating spatial geometry into hyperedge construction, the model enables the discovery of multicellular niches that span diverse cell types. We evaluate HyperNiche on high-plex Xenium spatial transcriptomics datasets from breast and lung cancer tissue microarrays, demonstrating improvements over state-of-the-art graph-based baselines in clustering performance (ARI, NMI) and biological interpretability. Further analysis shows that HyperNiche produces hyperedges with significantly higher intra-edge feature diversity, indicating an enhanced ability to capture heterogeneous cellular niches compared to similarity-based models. These results highlight the importance of higher-order relational modeling for understanding complex spatial tissue organization and tumor microenvironments.